Artificial Intelligence Nanodegree

Voice User Interfaces

Project: Speech Recognition with Neural Networks


In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with '(IMPLEMENTATION)' in the header indicate that the following blocks of code will require additional functionality which you must provide. Please be sure to read the instructions carefully!

Note: Once you have completed all of the code implementations, you need to finalize your work by exporting the Jupyter Notebook as an HTML document. Before exporting the notebook to html, all of the code cells need to have been run so that reviewers can see the final implementation and output. You can then export the notebook by using the menu above and navigating to \n", "File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. Markdown cells can be edited by double-clicking the cell to enter edit mode.

The rubric contains optional "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. If you decide to pursue the "Stand Out Suggestions", you should include the code in this Jupyter notebook.


Introduction

In this notebook, you will build a deep neural network that functions as part of an end-to-end automatic speech recognition (ASR) pipeline! Your completed pipeline will accept raw audio as input and return a predicted transcription of the spoken language. The full pipeline is summarized in the figure below.

  • STEP 1 is a pre-processing step that converts raw audio to one of two feature representations that are commonly used for ASR.
  • STEP 2 is an acoustic model which accepts audio features as input and returns a probability distribution over all potential transcriptions. After learning about the basic types of neural networks that are often used for acoustic modeling, you will engage in your own investigations, to design your own acoustic model!
  • STEP 3 in the pipeline takes the output from the acoustic model and returns a predicted transcription.

Feel free to use the links below to navigate the notebook:

The Data

We begin by investigating the dataset that will be used to train and evaluate your pipeline. LibriSpeech is a large corpus of English-read speech, designed for training and evaluating models for ASR. The dataset contains 1000 hours of speech derived from audiobooks. We will work with a small subset in this project, since larger-scale data would take a long while to train. However, after completing this project, if you are interested in exploring further, you are encouraged to work with more of the data that is provided online.

In the code cells below, you will use the vis_train_features module to visualize a training example. The supplied argument index=0 tells the module to extract the first example in the training set. (You are welcome to change index=0 to point to a different training example, if you like, but please DO NOT amend any other code in the cell.) The returned variables are:

  • vis_text - transcribed text (label) for the training example.
  • vis_raw_audio - raw audio waveform for the training example.
  • vis_mfcc_feature - mel-frequency cepstral coefficients (MFCCs) for the training example.
  • vis_spectrogram_feature - spectrogram for the training example.
  • vis_audio_path - the file path to the training example.

In [2]:
from data_generator import vis_train_features

# extract label and audio features for a single training example
vis_text, vis_raw_audio, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path = vis_train_features()


There are 2136 total training examples.

The following code cell visualizes the audio waveform for your chosen example, along with the corresponding transcript. You also have the option to play the audio in the notebook!


In [5]:
from IPython.display import Markdown, display
from data_generator import vis_train_features, plot_raw_audio
from IPython.display import Audio
%matplotlib inline

# plot audio signal
plot_raw_audio(vis_raw_audio)
# print length of audio signal
display(Markdown('**Shape of Audio Signal** : ' + str(vis_raw_audio.shape)))
# print transcript corresponding to audio clip
display(Markdown('**Transcript** : ' + str(vis_text)))
# play the audio file
Audio(vis_audio_path)


Shape of Audio Signal : (103966,)

Transcript : the last two days of the voyage bartley found almost intolerable

Out[5]:

STEP 1: Acoustic Features for Speech Recognition

For this project, you won't use the raw audio waveform as input to your model. Instead, we provide code that first performs a pre-processing step to convert the raw audio to a feature representation that has historically proven successful for ASR models. Your acoustic model will accept the feature representation as input.

In this project, you will explore two possible feature representations. After completing the project, if you'd like to read more about deep learning architectures that can accept raw audio input, you are encouraged to explore this research paper.

Spectrograms

The first option for an audio feature representation is the spectrogram. In order to complete this project, you will not need to dig deeply into the details of how a spectrogram is calculated; but, if you are curious, the code for calculating the spectrogram was borrowed from this repository. The implementation appears in the utils.py file in your repository.

The code that we give you returns the spectrogram as a 2D tensor, where the first (vertical) dimension indexes time, and the second (horizontal) dimension indexes frequency. To speed the convergence of your algorithm, we have also normalized the spectrogram. (You can see this quickly in the visualization below by noting that the mean value hovers around zero, and most entries in the tensor assume values close to zero.)


In [6]:
from data_generator import plot_spectrogram_feature

# plot normalized spectrogram
plot_spectrogram_feature(vis_spectrogram_feature)
# print shape of spectrogram
display(Markdown('**Shape of Spectrogram** : ' + str(vis_spectrogram_feature.shape)))


Shape of Spectrogram : (470, 161)

Mel-Frequency Cepstral Coefficients (MFCCs)

The second option for an audio feature representation is MFCCs. You do not need to dig deeply into the details of how MFCCs are calculated, but if you would like more information, you are welcome to peruse the documentation of the python_speech_features Python package. Just as with the spectrogram features, the MFCCs are normalized in the supplied code.

The main idea behind MFCC features is the same as spectrogram features: at each time window, the MFCC feature yields a feature vector that characterizes the sound within the window. Note that the MFCC feature is much lower-dimensional than the spectrogram feature, which could help an acoustic model to avoid overfitting to the training dataset.


In [7]:
from data_generator import plot_mfcc_feature

# plot normalized MFCC
plot_mfcc_feature(vis_mfcc_feature)
# print shape of MFCC
display(Markdown('**Shape of MFCC** : ' + str(vis_mfcc_feature.shape)))


Shape of MFCC : (470, 13)

When you construct your pipeline, you will be able to choose to use either spectrogram or MFCC features. If you would like to see different implementations that make use of MFCCs and/or spectrograms, please check out the links below:

STEP 2: Deep Neural Networks for Acoustic Modeling

In this section, you will experiment with various neural network architectures for acoustic modeling.

You will begin by training five relatively simple architectures. Model 0 is provided for you. You will write code to implement Models 1, 2, 3, and 4. If you would like to experiment further, you are welcome to create and train more models under the Models 5+ heading.

All models will be specified in the sample_models.py file. After importing the sample_models module, you will train your architectures in the notebook.

After experimenting with the five simple architectures, you will have the opportunity to compare their performance. Based on your findings, you will construct a deeper architecture that is designed to outperform all of the shallow models.

For your convenience, we have designed the notebook so that each model can be specified and trained on separate occasions. That is, say you decide to take a break from the notebook after training Model 1. Then, you need not re-execute all prior code cells in the notebook before training Model 2. You need only re-execute the code cell below, that is marked with RUN THIS CODE CELL IF YOU ARE RESUMING THE NOTEBOOK AFTER A BREAK, before transitioning to the code cells corresponding to Model 2.


In [1]:
#####################################################################
# RUN THIS CODE CELL IF YOU ARE RESUMING THE NOTEBOOK AFTER A BREAK #
#####################################################################

# allocate 50% of GPU memory (if you like, feel free to change this)
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf 
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
set_session(tf.Session(config=config))

# watch for any changes in the sample_models module, and reload it automatically
%load_ext autoreload
%autoreload 2
# import NN architectures for speech recognition
from sample_models import *
# import function for training acoustic model
from train_utils import train_model


Using TensorFlow backend.

Model 0: RNN

Given their effectiveness in modeling sequential data, the first acoustic model you will use is an RNN. As shown in the figure below, the RNN we supply to you will take the time sequence of audio features as input.

At each time step, the speaker pronounces one of 28 possible characters, including each of the 26 letters in the English alphabet, along with a space character (" "), and an apostrophe (').

The output of the RNN at each time step is a vector of probabilities with 29 entries, where the $i$-th entry encodes the probability that the $i$-th character is spoken in the time sequence. (The extra 29th character is an empty "character" used to pad training examples within batches containing uneven lengths.) If you would like to peek under the hood at how characters are mapped to indices in the probability vector, look at the char_map.py file in the repository. The figure below shows an equivalent, rolled depiction of the RNN that shows the output layer in greater detail.

The model has already been specified for you in Keras. To import it, you need only run the code cell below.


In [23]:
model_0 = simple_rnn_model(input_dim=161) # change to 13 if you would like to use MFCC features


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
rnn (GRU)                    (None, None, 29)          16617     
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 16,617
Trainable params: 16,617
Non-trainable params: 0
_________________________________________________________________
None

As explored in the lesson, you will train the acoustic model with the CTC loss criterion. Custom loss functions take a bit of hacking in Keras, and so we have implemented the CTC loss function for you, so that you can focus on trying out as many deep learning architectures as possible :). If you'd like to peek at the implementation details, look at the add_ctc_loss function within the train_utils.py file in the repository.

To train your architecture, you will use the train_model function within the train_utils module; it has already been imported in one of the above code cells. The train_model function takes three required arguments:

  • input_to_softmax - a Keras model instance.
  • pickle_path - the name of the pickle file where the loss history will be saved.
  • save_model_path - the name of the HDF5 file where the model will be saved.

If we have already supplied values for input_to_softmax, pickle_path, and save_model_path, please DO NOT modify these values.

There are several optional arguments that allow you to have more control over the training process. You are welcome to, but not required to, supply your own values for these arguments.

  • minibatch_size - the size of the minibatches that are generated while training the model (default: 20).
  • spectrogram - Boolean value dictating whether spectrogram (True) or MFCC (False) features are used for training (default: True).
  • mfcc_dim - the size of the feature dimension to use when generating MFCC features (default: 13).
  • optimizer - the Keras optimizer used to train the model (default: SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5)).
  • epochs - the number of epochs to use to train the model (default: 20). If you choose to modify this parameter, make sure that it is at least 20.
  • verbose - controls the verbosity of the training output in the model.fit_generator method (default: 1).
  • sort_by_duration - Boolean value dictating whether the training and validation sets are sorted by (increasing) duration before the start of the first epoch (default: False).

The train_model function defaults to using spectrogram features; if you choose to use these features, note that the acoustic model in simple_rnn_model should have input_dim=161. Otherwise, if you choose to use MFCC features, the acoustic model should have input_dim=13.

We have chosen to use GRU units in the supplied RNN. If you would like to experiment with LSTM or SimpleRNN cells, feel free to do so here. If you change the GRU units to SimpleRNN cells in simple_rnn_model, you may notice that the loss quickly becomes undefined (nan) - you are strongly encouraged to check this for yourself! This is due to the exploding gradients problem. We have already implemented gradient clipping in your optimizer to help you avoid this issue.

IMPORTANT NOTE: If you notice that your gradient has exploded in any of the models below, feel free to explore more with gradient clipping (the clipnorm argument in your optimizer) or swap out any SimpleRNN cells for LSTM or GRU cells. You can also try restarting the kernel to restart the training process.


In [24]:
train_model(input_to_softmax=model_0, 
            pickle_path='model_0.pickle', 
            save_model_path='model_0.h5',
            spectrogram=True) # change to False if you would like to use MFCC features


Epoch 1/20
106/106 [==============================] - 216s - loss: 845.3553 - val_loss: 740.8339
Epoch 2/20
106/106 [==============================] - 216s - loss: 759.6700 - val_loss: 726.9334
Epoch 3/20
106/106 [==============================] - 216s - loss: 751.8939 - val_loss: 728.6634
Epoch 4/20
106/106 [==============================] - 215s - loss: 752.7703 - val_loss: 718.5636
Epoch 5/20
106/106 [==============================] - 216s - loss: 752.5393 - val_loss: 734.2006
Epoch 6/20
106/106 [==============================] - 217s - loss: 752.4897 - val_loss: 719.5030
Epoch 7/20
106/106 [==============================] - 216s - loss: 752.5890 - val_loss: 731.4544
Epoch 8/20
106/106 [==============================] - 215s - loss: 752.3679 - val_loss: 728.6592
Epoch 9/20
106/106 [==============================] - 214s - loss: 751.3841 - val_loss: 724.1845
Epoch 10/20
106/106 [==============================] - 217s - loss: 752.6718 - val_loss: 728.0512
Epoch 11/20
106/106 [==============================] - 217s - loss: 752.3170 - val_loss: 727.5446
Epoch 12/20
106/106 [==============================] - 216s - loss: 751.7902 - val_loss: 722.8978
Epoch 13/20
106/106 [==============================] - 215s - loss: 752.1688 - val_loss: 720.5606
Epoch 14/20
106/106 [==============================] - 216s - loss: 752.1740 - val_loss: 734.4282
Epoch 15/20
106/106 [==============================] - 218s - loss: 752.1190 - val_loss: 724.9952
Epoch 16/20
106/106 [==============================] - 216s - loss: 751.7720 - val_loss: 722.3962
Epoch 17/20
106/106 [==============================] - 216s - loss: 752.1447 - val_loss: 729.3465
Epoch 18/20
106/106 [==============================] - 215s - loss: 752.0579 - val_loss: 721.3646
Epoch 19/20
106/106 [==============================] - 215s - loss: 752.0178 - val_loss: 724.0728
Epoch 20/20
106/106 [==============================] - 217s - loss: 752.5739 - val_loss: 727.3449

(IMPLEMENTATION) Model 1: RNN + TimeDistributed Dense

Read about the TimeDistributed wrapper and the BatchNormalization layer in the Keras documentation. For your next architecture, you will add batch normalization to the recurrent layer to reduce training times. The TimeDistributed layer will be used to find more complex patterns in the dataset. The unrolled snapshot of the architecture is depicted below.

The next figure shows an equivalent, rolled depiction of the RNN that shows the (TimeDistrbuted) dense and output layers in greater detail.

Use your research to complete the rnn_model function within the sample_models.py file. The function should specify an architecture that satisfies the following requirements:

  • The first layer of the neural network should be an RNN (SimpleRNN, LSTM, or GRU) that takes the time sequence of audio features as input. We have added GRU units for you, but feel free to change GRU to SimpleRNN or LSTM, if you like!
  • Whereas the architecture in simple_rnn_model treated the RNN output as the final layer of the model, you will use the output of your RNN as a hidden layer. Use TimeDistributed to apply a Dense layer to each of the time steps in the RNN output. Ensure that each Dense layer has output_dim units.

Use the code cell below to load your model into the model_1 variable. Use a value for input_dim that matches your chosen audio features, and feel free to change the values for units and activation to tweak the behavior of your recurrent layer.


In [3]:
model_1 = rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                    units=200,
                    activation='relu')


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
rnn (GRU)                    (None, None, 200)         217200    
_________________________________________________________________
batch_normalization_2 (Batch (None, None, 200)         800       
_________________________________________________________________
time_distributed_2 (TimeDist (None, None, 29)          5829      
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 223,829
Trainable params: 223,429
Non-trainable params: 400
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_1.h5. The loss history is saved in model_1.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.


In [4]:
train_model(input_to_softmax=model_1, 
            pickle_path='model_1.pickle', 
            save_model_path='model_1.h5',
            spectrogram=True) # change to False if you would like to use MFCC features


Epoch 1/20
106/106 [==============================] - 222s - loss: 306.9097 - val_loss: 307.4821
Epoch 2/20
106/106 [==============================] - 223s - loss: 210.9115 - val_loss: 197.3319
Epoch 3/20
106/106 [==============================] - 221s - loss: 178.0442 - val_loss: 172.3065
Epoch 4/20
106/106 [==============================] - 220s - loss: 161.9230 - val_loss: 158.0782
Epoch 5/20
106/106 [==============================] - 219s - loss: 151.8339 - val_loss: 154.2353
Epoch 6/20
106/106 [==============================] - 219s - loss: 144.7893 - val_loss: 146.0502
Epoch 7/20
106/106 [==============================] - 219s - loss: 138.9369 - val_loss: 144.1776
Epoch 8/20
106/106 [==============================] - 218s - loss: 134.7637 - val_loss: 139.6012
Epoch 9/20
106/106 [==============================] - 220s - loss: 130.9328 - val_loss: 140.9713
Epoch 10/20
106/106 [==============================] - 220s - loss: 128.1127 - val_loss: 136.3366
Epoch 11/20
106/106 [==============================] - 219s - loss: 124.8036 - val_loss: 135.5164
Epoch 12/20
106/106 [==============================] - 218s - loss: 122.2726 - val_loss: 133.2287
Epoch 13/20
106/106 [==============================] - 219s - loss: 121.2614 - val_loss: 135.3828
Epoch 14/20
106/106 [==============================] - 218s - loss: 118.4146 - val_loss: 135.9902
Epoch 15/20
106/106 [==============================] - 219s - loss: 116.9746 - val_loss: 132.2172
Epoch 16/20
106/106 [==============================] - 221s - loss: 115.3054 - val_loss: 130.5474
Epoch 17/20
106/106 [==============================] - 220s - loss: 113.6723 - val_loss: 130.6809
Epoch 18/20
106/106 [==============================] - 218s - loss: 113.4530 - val_loss: 131.6026
Epoch 19/20
106/106 [==============================] - 220s - loss: 112.2385 - val_loss: 137.1105
Epoch 20/20
106/106 [==============================] - 220s - loss: 114.1109 - val_loss: 134.0142

(IMPLEMENTATION) Model 2: CNN + RNN + TimeDistributed Dense

The architecture in cnn_rnn_model adds an additional level of complexity, by introducing a 1D convolution layer.

This layer incorporates many arguments that can be (optionally) tuned when calling the cnn_rnn_model module. We provide sample starting parameters, which you might find useful if you choose to use spectrogram audio features.

If you instead want to use MFCC features, these arguments will have to be tuned. Note that the current architecture only supports values of 'same' or 'valid' for the conv_border_mode argument.

When tuning the parameters, be careful not to choose settings that make the convolutional layer overly small. If the temporal length of the CNN layer is shorter than the length of the transcribed text label, your code will throw an error.

Before running the code cell below, you must modify the cnn_rnn_model function in sample_models.py. Please add batch normalization to the recurrent layer, and provide the same TimeDistributed layer as before.


In [2]:
model_2 = cnn_rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                        filters=200,
                        kernel_size=11, 
                        conv_stride=2,
                        conv_border_mode='valid',
                        units=200)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 200)         354400    
_________________________________________________________________
bn_conv_1d (BatchNormalizati (None, None, 200)         800       
_________________________________________________________________
rnn (GRU)                    (None, None, 200)         240600    
_________________________________________________________________
bn_simp_rnn (BatchNormalizat (None, None, 200)         800       
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 29)          5829      
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 602,429
Trainable params: 601,629
Non-trainable params: 800
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_2.h5. The loss history is saved in model_2.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.


In [3]:
train_model(input_to_softmax=model_2, 
            pickle_path='model_2.pickle', 
            save_model_path='model_2.h5',
            spectrogram=True) # change to False if you would like to use MFCC features


Epoch 1/20
106/106 [==============================] - 118s - loss: 247.7437 - val_loss: 231.3876
Epoch 2/20
106/106 [==============================] - 116s - loss: 180.2408 - val_loss: 164.4583
Epoch 3/20
106/106 [==============================] - 116s - loss: 151.2081 - val_loss: 150.5839
Epoch 4/20
106/106 [==============================] - 115s - loss: 137.8753 - val_loss: 139.5028
Epoch 5/20
106/106 [==============================] - 115s - loss: 128.3218 - val_loss: 137.2442
Epoch 6/20
106/106 [==============================] - 114s - loss: 121.1707 - val_loss: 133.7132
Epoch 7/20
106/106 [==============================] - 114s - loss: 114.7692 - val_loss: 130.5362
Epoch 8/20
106/106 [==============================] - 113s - loss: 109.1703 - val_loss: 127.1819
Epoch 9/20
106/106 [==============================] - 114s - loss: 103.9699 - val_loss: 129.7073
Epoch 10/20
106/106 [==============================] - 113s - loss: 99.2999 - val_loss: 126.4226
Epoch 11/20
106/106 [==============================] - 113s - loss: 94.5972 - val_loss: 127.6528
Epoch 12/20
106/106 [==============================] - 113s - loss: 90.5315 - val_loss: 127.8962
Epoch 13/20
106/106 [==============================] - 114s - loss: 86.2354 - val_loss: 133.0719
Epoch 14/20
106/106 [==============================] - 114s - loss: 82.6942 - val_loss: 131.5294
Epoch 15/20
106/106 [==============================] - 114s - loss: 78.8500 - val_loss: 137.1646
Epoch 16/20
106/106 [==============================] - 113s - loss: 74.7816 - val_loss: 135.9730
Epoch 17/20
106/106 [==============================] - 115s - loss: 71.7061 - val_loss: 140.6504
Epoch 18/20
106/106 [==============================] - 114s - loss: 68.3023 - val_loss: 141.4044
Epoch 19/20
106/106 [==============================] - 113s - loss: 64.9206 - val_loss: 144.4121
Epoch 20/20
106/106 [==============================] - 114s - loss: 62.1754 - val_loss: 148.5745

(IMPLEMENTATION) Model 3: Deeper RNN + TimeDistributed Dense

Review the code in rnn_model, which makes use of a single recurrent layer. Now, specify an architecture in deep_rnn_model that utilizes a variable number recur_layers of recurrent layers. The figure below shows the architecture that should be returned if recur_layers=2. In the figure, the output sequence of the first recurrent layer is used as input for the next recurrent layer.

Feel free to change the supplied values of units to whatever you think performs best. You can change the value of recur_layers, as long as your final value is greater than 1. (As a quick check that you have implemented the additional functionality in deep_rnn_model correctly, make sure that the architecture that you specify here is identical to rnn_model if recur_layers=1.)


In [17]:
model_3 = deep_rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                         units=200,
                         recur_layers=2)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
recurrent_rnn_0 (GRU)        (None, None, 200)         217200    
_________________________________________________________________
bn_recurrent_rnn_0 (BatchNor (None, None, 200)         800       
_________________________________________________________________
recurrent_rnn_1 (GRU)        (None, None, 200)         240600    
_________________________________________________________________
bn_recurrent_rnn_1 (BatchNor (None, None, 200)         800       
_________________________________________________________________
time_distributed_9 (TimeDist (None, None, 29)          5829      
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 465,229
Trainable params: 464,429
Non-trainable params: 800
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_3.h5. The loss history is saved in model_3.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.


In [18]:
train_model(input_to_softmax=model_3, 
            pickle_path='model_3.pickle', 
            save_model_path='model_3.h5', 
            spectrogram=True) # change to False if you would like to use MFCC features


Epoch 1/20
106/106 [==============================] - 374s - loss: 390.2969 - val_loss: 332.6760
Epoch 2/20
106/106 [==============================] - 379s - loss: 232.0555 - val_loss: 259.6133
Epoch 3/20
106/106 [==============================] - 378s - loss: 193.0228 - val_loss: 176.5208
Epoch 4/20
106/106 [==============================] - 377s - loss: 164.7649 - val_loss: 156.5154
Epoch 5/20
106/106 [==============================] - 378s - loss: 147.5944 - val_loss: 156.0791
Epoch 6/20
106/106 [==============================] - 378s - loss: 137.6360 - val_loss: 145.0914
Epoch 7/20
106/106 [==============================] - 378s - loss: 130.6393 - val_loss: 137.4015
Epoch 8/20
106/106 [==============================] - 375s - loss: 125.8150 - val_loss: 137.3835
Epoch 9/20
106/106 [==============================] - 378s - loss: 122.7980 - val_loss: 132.8275
Epoch 10/20
106/106 [==============================] - 378s - loss: 118.3026 - val_loss: 131.5892
Epoch 11/20
106/106 [==============================] - 378s - loss: 116.1538 - val_loss: 130.5838
Epoch 12/20
106/106 [==============================] - 378s - loss: 114.5137 - val_loss: 131.1469
Epoch 13/20
106/106 [==============================] - 379s - loss: 112.9381 - val_loss: 132.4737
Epoch 14/20
106/106 [==============================] - 380s - loss: 109.5681 - val_loss: 123.9704
Epoch 15/20
106/106 [==============================] - 377s - loss: 107.4095 - val_loss: 126.5864
Epoch 16/20
106/106 [==============================] - 378s - loss: 107.0830 - val_loss: 128.0773
Epoch 17/20
106/106 [==============================] - 378s - loss: 106.0536 - val_loss: 129.5277
Epoch 18/20
106/106 [==============================] - 378s - loss: 104.5133 - val_loss: 127.8968
Epoch 19/20
106/106 [==============================] - 376s - loss: 102.6818 - val_loss: 124.9978
Epoch 20/20
106/106 [==============================] - 380s - loss: 100.6665 - val_loss: 126.7010

(IMPLEMENTATION) Model 4: Bidirectional RNN + TimeDistributed Dense

Read about the Bidirectional wrapper in the Keras documentation. For your next architecture, you will specify an architecture that uses a single bidirectional RNN layer, before a (TimeDistributed) dense layer. The added value of a bidirectional RNN is described well in this paper.

One shortcoming of conventional RNNs is that they are only able to make use of previous context. In speech recognition, where whole utterances are transcribed at once, there is no reason not to exploit future context as well. Bidirectional RNNs (BRNNs) do this by processing the data in both directions with two separate hidden layers which are then fed forwards to the same output layer.

Before running the code cell below, you must complete the bidirectional_rnn_model function in sample_models.py. Feel free to use SimpleRNN, LSTM, or GRU units. When specifying the Bidirectional wrapper, use merge_mode='concat'.


In [25]:
model_4 = bidirectional_rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                                  units=200)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
bidirectional_9 (Bidirection (None, None, 400)         434400    
_________________________________________________________________
time_distributed_13 (TimeDis (None, None, 29)          11629     
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 446,029
Trainable params: 446,029
Non-trainable params: 0
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_4.h5. The loss history is saved in model_4.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.


In [26]:
train_model(input_to_softmax=model_4, 
            pickle_path='model_4.pickle', 
            save_model_path='model_4.h5', 
            spectrogram=True) # change to False if you would like to use MFCC features


Epoch 1/20
106/106 [==============================] - 365s - loss: 286.6463 - val_loss: 212.6980
Epoch 2/20
106/106 [==============================] - 369s - loss: 206.9594 - val_loss: 194.4766
Epoch 3/20
106/106 [==============================] - 370s - loss: 191.6510 - val_loss: 184.1911
Epoch 4/20
106/106 [==============================] - 367s - loss: 182.0327 - val_loss: 176.5197
Epoch 5/20
106/106 [==============================] - 370s - loss: 173.4004 - val_loss: 169.5175
Epoch 6/20
106/106 [==============================] - 368s - loss: 166.8569 - val_loss: 170.2993
Epoch 7/20
106/106 [==============================] - 369s - loss: 160.3366 - val_loss: 159.0140
Epoch 8/20
106/106 [==============================] - 369s - loss: 153.7445 - val_loss: 160.8792
Epoch 9/20
106/106 [==============================] - 369s - loss: 147.8207 - val_loss: 151.1923
Epoch 10/20
106/106 [==============================] - 367s - loss: 142.1783 - val_loss: 148.3007
Epoch 11/20
106/106 [==============================] - 371s - loss: 137.6096 - val_loss: 143.4802
Epoch 12/20
106/106 [==============================] - 369s - loss: 133.5550 - val_loss: 143.5125
Epoch 13/20
106/106 [==============================] - 369s - loss: 129.7062 - val_loss: 141.7700
Epoch 14/20
106/106 [==============================] - 369s - loss: 126.0751 - val_loss: 136.4081
Epoch 15/20
106/106 [==============================] - 368s - loss: 122.4594 - val_loss: 136.1303
Epoch 16/20
106/106 [==============================] - 369s - loss: 119.2986 - val_loss: 137.2347
Epoch 17/20
106/106 [==============================] - 371s - loss: 116.4967 - val_loss: 139.5178
Epoch 18/20
106/106 [==============================] - 369s - loss: 113.4465 - val_loss: 136.2991
Epoch 19/20
106/106 [==============================] - 369s - loss: 110.9780 - val_loss: 134.7229
Epoch 20/20
106/106 [==============================] - 370s - loss: 108.5471 - val_loss: 133.7598

(OPTIONAL IMPLEMENTATION) Models 5+

If you would like to try out more architectures than the ones above, please use the code cell below. Please continue to follow the same convention for saving the models; for the $i$-th sample model, please save the loss at model_i.pickle and saving the trained model at model_i.h5.


In [ ]:
## (Optional) TODO: Try out some more models!
### Feel free to use as many code cells as needed.

Compare the Models

Execute the code cell below to evaluate the performance of the drafted deep learning models. The training and validation loss are plotted for each model.


In [4]:
from glob import glob
import numpy as np
import _pickle as pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style(style='white')

# obtain the paths for the saved model history
all_pickles = sorted(glob("results/*.pickle"))
# extract the name of each model
model_names = [item[8:-7] for item in all_pickles]
# extract the loss history for each model
valid_loss = [pickle.load( open( i, "rb" ) )['val_loss'] for i in all_pickles]
train_loss = [pickle.load( open( i, "rb" ) )['loss'] for i in all_pickles]
# save the number of epochs used to train each model
num_epochs = [len(valid_loss[i]) for i in range(len(valid_loss))]

fig = plt.figure(figsize=(16,5))

# plot the training loss vs. epoch for each model
ax1 = fig.add_subplot(121)
for i in range(len(all_pickles)):
    ax1.plot(np.linspace(1, num_epochs[i], num_epochs[i]), 
            train_loss[i], label=model_names[i])
# clean up the plot
ax1.legend()  
ax1.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Training Loss')

# plot the validation loss vs. epoch for each model
ax2 = fig.add_subplot(122)
for i in range(len(all_pickles)):
    ax2.plot(np.linspace(1, num_epochs[i], num_epochs[i]), 
            valid_loss[i], label=model_names[i])
# clean up the plot
ax2.legend()  
ax2.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.show()


Question 1: Use the plot above to analyze the performance of each of the attempted architectures. Which performs best? Provide an explanation regarding why you think some models perform better than others.

Answer: We can observe that model_2 (CNN + RNN + TimeDistributed Dense) has the best performance on the training data set, with a training loss of 62.1754 after 20 epochs . However, looking at the validation loss we can see that it starts overfitting because the loss value starts increasing around epoch 10. There are ways to handle overfitting - e.g. early stopping, adding dropout - , therefore this can still be a very good model to start with.

One of the reasons this model performs well is that we are using spectrograms to visually represent speech features and CNNs generally perform very well on visual data.

I initially had an exploding gradient issue with model_2 and my loss values were NaN. I was using the SimpleRNN model. After switching to GRU and restarting the kernel, the exploding gradient issue disappeared.

All other models except model_0 behaved well enough, the second best being model_3, with the 2 recurrent RNN layers. The training loss after 20 epochs is 100.6665 and the validation loss is 126.7010. The differrence between them is acceptable and model does not overfit or underfit the training set.

(IMPLEMENTATION) Final Model

Now that you've tried out many sample models, use what you've learned to draft your own architecture! While your final acoustic model should not be identical to any of the architectures explored above, you are welcome to merely combine the explored layers above into a deeper architecture. It is NOT necessary to include new layer types that were not explored in the notebook.

However, if you would like some ideas for even more layer types, check out these ideas for some additional, optional extensions to your model:

  • If you notice your model is overfitting to the training dataset, consider adding dropout! To add dropout to recurrent layers, pay special attention to the dropout_W and dropout_U arguments. This paper may also provide some interesting theoretical background.
  • If you choose to include a convolutional layer in your model, you may get better results by working with dilated convolutions. If you choose to use dilated convolutions, make sure that you are able to accurately calculate the length of the acoustic model's output in the model.output_length lambda function. You can read more about dilated convolutions in Google's WaveNet paper. For an example of a speech-to-text system that makes use of dilated convolutions, check out this GitHub repository. You can work with dilated convolutions in Keras by paying special attention to the padding argument when you specify a convolutional layer.
  • If your model makes use of convolutional layers, why not also experiment with adding max pooling? Check out this paper for example architecture that makes use of max pooling in an acoustic model.
  • So far, you have experimented with a single bidirectional RNN layer. Consider stacking the bidirectional layers, to produce a deep bidirectional RNN!

All models that you specify in this repository should have output_length defined as an attribute. This attribute is a lambda function that maps the (temporal) length of the input acoustic features to the (temporal) length of the output softmax layer. This function is used in the computation of CTC loss; to see this, look at the add_ctc_loss function in train_utils.py. To see where the output_length attribute is defined for the models in the code, take a look at the sample_models.py file. You will notice this line of code within most models:

model.output_length = lambda x: x

The acoustic model that incorporates a convolutional layer (cnn_rnn_model) has a line that is a bit different:

model.output_length = lambda x: cnn_output_length(
        x, kernel_size, conv_border_mode, conv_stride)

In the case of models that use purely recurrent layers, the lambda function is the identity function, as the recurrent layers do not modify the (temporal) length of their input tensors. However, convolutional layers are more complicated and require a specialized function (cnn_output_length in sample_models.py) to determine the temporal length of their output.

You will have to add the output_length attribute to your final model before running the code cell below. Feel free to use the cnn_output_length function, if it suits your model.


In [2]:
# specify the model
model_end = final_model(input_dim=161, # change to 13 if you would like to use MFCC features
                         units=200,
                         recur_layers=2,
                         filters=200,
                         kernel_size=11, 
                         conv_stride=2,
                         conv_border_mode='valid')


/home/ubuntu/AIND-VUI-Capstone/sample_models.py:176: UserWarning: Update your `GRU` call to the Keras 2 API: `GRU(200, return_sequences=True, name="recurrent_rnn_0", activation="relu", dropout=0.4, implementation=2, recurrent_dropout=0.4)`
  dropout_W=0.4, dropout_U=0.4)
/home/ubuntu/AIND-VUI-Capstone/sample_models.py:176: UserWarning: Update your `GRU` call to the Keras 2 API: `GRU(200, return_sequences=True, name="recurrent_rnn_1", activation="relu", dropout=0.4, implementation=2, recurrent_dropout=0.4)`
  dropout_W=0.4, dropout_U=0.4)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 200)         354400    
_________________________________________________________________
bn_conv_1d (BatchNormalizati (None, None, 200)         800       
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 400)         481200    
_________________________________________________________________
bn_recurrent_rnn_0 (BatchNor (None, None, 400)         1600      
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 400)         721200    
_________________________________________________________________
bn_recurrent_rnn_1 (BatchNor (None, None, 400)         1600      
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 29)          11629     
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 1,572,429
Trainable params: 1,570,429
Non-trainable params: 2,000
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_end.h5. The loss history is saved in model_end.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.


In [3]:
train_model(input_to_softmax=model_end, 
            pickle_path='model_end.pickle', 
            save_model_path='model_end.h5', 
            spectrogram=True) # change to False if you would like to use MFCC features


Epoch 1/20
106/106 [==============================] - 349s - loss: 253.3962 - val_loss: 230.2944
Epoch 2/20
106/106 [==============================] - 351s - loss: 210.4323 - val_loss: 184.3782
Epoch 3/20
106/106 [==============================] - 346s - loss: 184.6225 - val_loss: 161.5775
Epoch 4/20
106/106 [==============================] - 346s - loss: 168.5587 - val_loss: 148.0438
Epoch 5/20
106/106 [==============================] - 343s - loss: 156.8022 - val_loss: 137.6756
Epoch 6/20
106/106 [==============================] - 343s - loss: 148.6631 - val_loss: 135.0179
Epoch 7/20
106/106 [==============================] - 342s - loss: 142.6012 - val_loss: 128.2467
Epoch 8/20
106/106 [==============================] - 341s - loss: 137.1569 - val_loss: 124.8640
Epoch 9/20
106/106 [==============================] - 342s - loss: 133.5040 - val_loss: 121.7355
Epoch 10/20
106/106 [==============================] - 341s - loss: 129.0078 - val_loss: 118.8663
Epoch 11/20
106/106 [==============================] - 342s - loss: 126.3363 - val_loss: 116.3399
Epoch 12/20
106/106 [==============================] - 343s - loss: 123.1024 - val_loss: 117.3443
Epoch 13/20
106/106 [==============================] - 344s - loss: 119.9809 - val_loss: 113.0286
Epoch 14/20
106/106 [==============================] - 344s - loss: 117.9996 - val_loss: 114.0767
Epoch 15/20
106/106 [==============================] - 344s - loss: 115.7128 - val_loss: 109.3831
Epoch 16/20
106/106 [==============================] - 344s - loss: 113.8068 - val_loss: 109.4694
Epoch 17/20
106/106 [==============================] - 344s - loss: 111.6350 - val_loss: 108.1214
Epoch 18/20
106/106 [==============================] - 341s - loss: 110.2124 - val_loss: 105.1850
Epoch 19/20
106/106 [==============================] - 344s - loss: 108.0657 - val_loss: 106.8586
Epoch 20/20
106/106 [==============================] - 340s - loss: 106.7270 - val_loss: 105.8960

Question 2: Describe your final model architecture and your reasoning at each step.

Answer: I saw that model_2 performed quite well and I decided to use a CNN in my model and also add a dropout to the RNN layers to avoid overfitting the training set. I added dropout_W=0.4 and dropout_U=0.4 for each RNN layer.

I used recurrent RNN layers with BatchNormalization to reduce training times and a Bidirectional wrapper, to make use of the previous context. I also used a TimeDistributed Dense layer to find more complex patterns in the data set.

From the initial 5 models, I saw that the above layers had good results and I decided to combine them. The results were somewhat surprising, because the validation loss was consistently lower than the training loss. The difference is not big and it shrinks with training. The reasons for this could be:

  • data was not uniformly split between training and validation, with the training data having examples with greater variation
  • validation set is small and the model could be still underfitting, even though the training loss was good compared to the other models. We can further try lower dropout values, especially dropout_U, used for recurrent connections.

STEP 3: Obtain Predictions

We have written a function for you to decode the predictions of your acoustic model. To use the function, please execute the code cell below.


In [4]:
import numpy as np
from data_generator import AudioGenerator
from keras import backend as K
from utils import int_sequence_to_text
from IPython.display import Audio

def get_predictions(index, partition, input_to_softmax, model_path):
    """ Print a model's decoded predictions
    Params:
        index (int): The example you would like to visualize
        partition (str): One of 'train' or 'validation'
        input_to_softmax (Model): The acoustic model
        model_path (str): Path to saved acoustic model's weights
    """
    # load the train and test data
    data_gen = AudioGenerator()
    data_gen.load_train_data()
    data_gen.load_validation_data()
    
    # obtain the true transcription and the audio features 
    if partition == 'validation':
        transcr = data_gen.valid_texts[index]
        audio_path = data_gen.valid_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    elif partition == 'train':
        transcr = data_gen.train_texts[index]
        audio_path = data_gen.train_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    else:
        raise Exception('Invalid partition!  Must be "train" or "validation"')
        
    # obtain and decode the acoustic model's predictions
    input_to_softmax.load_weights(model_path)
    prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
    output_length = [input_to_softmax.output_length(data_point.shape[0])] 
    pred_ints = (K.eval(K.ctc_decode(
                prediction, output_length)[0][0])+1).flatten().tolist()
    
    # play the audio file, and display the true and predicted transcriptions
    print('-'*80)
    Audio(audio_path)
    print('True transcription:\n' + '\n' + transcr)
    print('-'*80)
    print('Predicted transcription:\n' + '\n' + ''.join(int_sequence_to_text(pred_ints)))
    print('-'*80)

Use the code cell below to obtain the transcription predicted by your final model for the first example in the training dataset.


In [8]:
get_predictions(index=0, 
                partition='train',
                input_to_softmax=final_model(
                         input_dim=161, # change to 13 if you would like to use MFCC features
                         units=200,
                         recur_layers=2,
                         filters=200,
                         kernel_size=11, 
                         conv_stride=2,
                         conv_border_mode='valid'), 
                model_path='results/model_end.h5')


/home/ubuntu/AIND-VUI-Capstone/sample_models.py:176: UserWarning: Update your `GRU` call to the Keras 2 API: `GRU(200, return_sequences=True, name="recurrent_rnn_0", activation="relu", dropout=0.4, implementation=2, recurrent_dropout=0.4)`
  dropout_W=0.4, dropout_U=0.4)
/home/ubuntu/AIND-VUI-Capstone/sample_models.py:176: UserWarning: Update your `GRU` call to the Keras 2 API: `GRU(200, return_sequences=True, name="recurrent_rnn_1", activation="relu", dropout=0.4, implementation=2, recurrent_dropout=0.4)`
  dropout_W=0.4, dropout_U=0.4)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 200)         354400    
_________________________________________________________________
bn_conv_1d (BatchNormalizati (None, None, 200)         800       
_________________________________________________________________
bidirectional_7 (Bidirection (None, None, 400)         481200    
_________________________________________________________________
bn_recurrent_rnn_0 (BatchNor (None, None, 400)         1600      
_________________________________________________________________
bidirectional_8 (Bidirection (None, None, 400)         721200    
_________________________________________________________________
bn_recurrent_rnn_1 (BatchNor (None, None, 400)         1600      
_________________________________________________________________
time_distributed_4 (TimeDist (None, None, 29)          11629     
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 1,572,429
Trainable params: 1,570,429
Non-trainable params: 2,000
_________________________________________________________________
None
--------------------------------------------------------------------------------
True transcription:

the last two days of the voyage bartley found almost intolerable
--------------------------------------------------------------------------------
Predicted transcription:

the los twodas of if oach bortly fond omest in tollrble
--------------------------------------------------------------------------------

Use the next code cell to visualize the model's prediction for the first example in the validation dataset.


In [11]:
get_predictions(index=0, 
                partition='validation',
                input_to_softmax=final_model(
                         input_dim=161, # change to 13 if you would like to use MFCC features
                         units=200,
                         recur_layers=2,
                         filters=200,
                         kernel_size=11, 
                         conv_stride=2,
                         conv_border_mode='valid'), 
                model_path='results/model_end.h5')


/home/ubuntu/AIND-VUI-Capstone/sample_models.py:176: UserWarning: Update your `GRU` call to the Keras 2 API: `GRU(200, return_sequences=True, name="recurrent_rnn_0", activation="relu", dropout=0.4, implementation=2, recurrent_dropout=0.4)`
  dropout_W=0.4, dropout_U=0.4)
/home/ubuntu/AIND-VUI-Capstone/sample_models.py:176: UserWarning: Update your `GRU` call to the Keras 2 API: `GRU(200, return_sequences=True, name="recurrent_rnn_1", activation="relu", dropout=0.4, implementation=2, recurrent_dropout=0.4)`
  dropout_W=0.4, dropout_U=0.4)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 200)         354400    
_________________________________________________________________
bn_conv_1d (BatchNormalizati (None, None, 200)         800       
_________________________________________________________________
bidirectional_11 (Bidirectio (None, None, 400)         481200    
_________________________________________________________________
bn_recurrent_rnn_0 (BatchNor (None, None, 400)         1600      
_________________________________________________________________
bidirectional_12 (Bidirectio (None, None, 400)         721200    
_________________________________________________________________
bn_recurrent_rnn_1 (BatchNor (None, None, 400)         1600      
_________________________________________________________________
time_distributed_6 (TimeDist (None, None, 29)          11629     
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 1,572,429
Trainable params: 1,570,429
Non-trainable params: 2,000
_________________________________________________________________
None
--------------------------------------------------------------------------------
True transcription:

out in the woods stood a nice little fir tree
--------------------------------------------------------------------------------
Predicted transcription:

o an tho wod stot en ni cse thetal firtry
--------------------------------------------------------------------------------

One standard way to improve the results of the decoder is to incorporate a language model. We won't pursue this in the notebook, but you are welcome to do so as an optional extension.

If you are interested in creating models that provide improved transcriptions, you are encouraged to download more data and train bigger, deeper models. But beware - the model will likely take a long while to train. For instance, training this state-of-the-art model would take 3-6 weeks on a single GPU!